Speedup `Benchmarker.prepare` (`compute_connectivities_umap`) #128

jan-engelmann · 2023-12-20T23:18:55Z

Thanks for the great package!

Unfortunately Benchmarker.prepare() is very slow due to currently released scanpy code:
sc.neighbors._compute_connectivities_umap is highly inefficient (see this monster).

One iteration of the KNN calculation takes 2:30h for 7 Million cells. A large part of this time is spent in sc.neighbors._compute_connectivities_umap.

I see two solutions:
A Use the approach proposed in this PR
B Use new scanpy code not yet released but already on main

A This PR

The solution in this PR offers the following speedup:

B Unreleased Scanpy code

The Neighbors scanpy code on main has been thoroughly refactored by @Zethson and likely offers a way to do this efficiently. For example scanpy.neighbors._common._get_sparse_matrix_from_indices_distances looks promising. I can look more into this if there's interest.

What do you think @Zethson @adamgayoso

Zethson · 2023-12-21T10:49:09Z

I didn't touch the neighbors code. Maybe I merged a PR but that's about it.

I'd rather lobby for a new scanpy release and bring any improvements upstream into scanpy.

But cool work @jan-engelmann !

adamgayoso

Thanks for this! I think this should be added to the package so we don't rely on a private fn of scanpy. Also the fn is quite simple.

adamgayoso · 2023-12-27T14:19:45Z

src/scib_metrics/utils/_utils.py

+    knn_indices,
+    knn_dists,
+    n_obs,
+    n_neighbors,
+    set_op_mix_ratio=1.0,
+    local_connectivity=1.0,
+):


Please add typing

adamgayoso · 2023-12-27T14:20:14Z

src/scib_metrics/utils/_utils.py

+    set_op_mix_ratio=1.0,
+    local_connectivity=1.0,
+):
+    """Sped up version of sc.neighbors._compute_connectivities_umap."""


Can you put a more general docstring? Overview of the method and that it matches how connectivies are computed in scanpy?

adamgayoso · 2023-12-27T14:20:38Z

src/scib_metrics/utils/_utils.py

+        X,
+        n_neighbors,
+        None,
+        None,


nit: I prefer using keywords everywhere

adamgayoso · 2023-12-27T14:23:19Z

tests/test_utils.py

+    assert (new_dist == sc_dist).todense().all()
+    assert (new_connect == sc_connect).todense().all()


nit: use np.testing

adamgayoso · 2023-12-27T14:25:16Z

tests/test_utils.py

+    assert (new_connect == sc_connect).todense().all()
+
+
+def test_timing_compute_connectivities_umap():


I don't think this test is necessary if you add a reproducible script to this PR description

adamgayoso · 2023-12-27T14:42:06Z

src/scib_metrics/utils/_utils.py

+    if isinstance(connectivities, tuple):
+        # In umap-learn 0.4, this returns (result, sigmas, rhos)
+        connectivities = connectivities[0]


is this bit still necessary? What's the lower bound for scanpy? If we merge this PR we will need to make umap a direct dependency, so please also add that and potentially remove this block.

adamgayoso · 2023-12-27T16:05:22Z

Actually I would hold off on this PR. I see some redundancies in converting back and forth from sparse distance matrices that I'd like to address first. This should eliminate the need for building the sparse distance graph here

adamgayoso · 2023-12-27T22:28:42Z

@jan-engelmann feel free to give a review to #129

adamgayoso · 2023-12-28T17:13:23Z

Closed due to #129

@jan-engelmann I will add you as an author on the PR

jan-engelmann · 2023-12-29T12:12:02Z

But cool work @jan-engelmann !

thanks @Zethson :)

jan-engelmann · 2023-12-29T12:14:29Z

Closed due to #129

@jan-engelmann I will add you as an author on the PR

Thanks @adamgayoso I appreciate it! I like the refactoring. Very clean! 👍
Left a review on #129 but did not request any changes.

jan-engelmann · 2023-12-29T12:19:10Z

Also thanks for the review! @adamgayoso
Will keep these things in mind for the next time

speedup compute_connectivities_umap

a763236

adamgayoso reviewed Dec 27, 2023

View reviewed changes

adamgayoso closed this Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup `Benchmarker.prepare` (`compute_connectivities_umap`) #128

Speedup `Benchmarker.prepare` (`compute_connectivities_umap`) #128

jan-engelmann commented Dec 20, 2023

Zethson commented Dec 21, 2023

adamgayoso left a comment

adamgayoso Dec 27, 2023

adamgayoso Dec 27, 2023

adamgayoso Dec 27, 2023

adamgayoso Dec 27, 2023

adamgayoso Dec 27, 2023

adamgayoso Dec 27, 2023

adamgayoso commented Dec 27, 2023

adamgayoso commented Dec 27, 2023

adamgayoso commented Dec 28, 2023

jan-engelmann commented Dec 29, 2023

jan-engelmann commented Dec 29, 2023

jan-engelmann commented Dec 29, 2023

		assert (new_dist == sc_dist).todense().all()
		assert (new_connect == sc_connect).todense().all()

		assert (new_connect == sc_connect).todense().all()


		def test_timing_compute_connectivities_umap():

Speedup Benchmarker.prepare (compute_connectivities_umap) #128

Speedup Benchmarker.prepare (compute_connectivities_umap) #128

Conversation

jan-engelmann commented Dec 20, 2023

A This PR

B Unreleased Scanpy code

Zethson commented Dec 21, 2023

adamgayoso left a comment

Choose a reason for hiding this comment

adamgayoso Dec 27, 2023

Choose a reason for hiding this comment

adamgayoso Dec 27, 2023

Choose a reason for hiding this comment

adamgayoso Dec 27, 2023

Choose a reason for hiding this comment

adamgayoso Dec 27, 2023

Choose a reason for hiding this comment

adamgayoso Dec 27, 2023

Choose a reason for hiding this comment

adamgayoso Dec 27, 2023

Choose a reason for hiding this comment

adamgayoso commented Dec 27, 2023

adamgayoso commented Dec 27, 2023

adamgayoso commented Dec 28, 2023

jan-engelmann commented Dec 29, 2023

jan-engelmann commented Dec 29, 2023

jan-engelmann commented Dec 29, 2023

Speedup `Benchmarker.prepare` (`compute_connectivities_umap`) #128

Speedup `Benchmarker.prepare` (`compute_connectivities_umap`) #128